Goto

Collaborating Authors

 vulnerable statement


Weakly Supervised Vulnerability Localization via Multiple Instance Learning

arXiv.org Artificial Intelligence

Software vulnerability detection has emerged as a significant concern in the field of software security recently, capturing the attention of numerous researchers and developers. Most previous approaches focus on coarse-grained vulnerability detection, such as at the function or file level. However, the developers would still encounter the challenge of manually inspecting a large volume of code inside the vulnerable function to identify the specific vulnerable statements for modification, indicating the importance of vulnerability localization. Training the model for vulnerability localization usually requires ground-truth labels at the statement-level, and labeling vulnerable statements demands expert knowledge, which incurs high costs. Hence, the demand for an approach that eliminates the need for additional labeling at the statement-level is on the rise. To tackle this problem, we propose a novel approach called WAVES for WeAkly supervised Vulnerability Localization via multiplE inStance learning, which does not need the additional statement-level labels during the training. WAVES has the capability to determine whether a function is vulnerable (i.e., vulnerability detection) and pinpoint the vulnerable statements (i.e., vulnerability localization). Specifically, inspired by the concept of multiple instance learning, WAVES converts the ground-truth label at the function-level into pseudo labels for individual statements, eliminating the need for additional statement-level labeling. These pseudo labels are utilized to train the classifiers for the function-level representation vectors. Extensive experimentation on three popular benchmark datasets demonstrates that, in comparison to previous baselines, our approach achieves comparable performance in vulnerability detection and state-of-the-art performance in statement-level vulnerability localization.


Do Language Models Learn Semantics of Code? A Case Study in Vulnerability Detection

arXiv.org Artificial Intelligence

Recently, pretrained language models have shown state-of-the-art performance on the vulnerability detection task. These models are pretrained on a large corpus of source code, then fine-tuned on a smaller supervised vulnerability dataset. Due to the different training objectives and the performance of the models, it is interesting to consider whether the models have learned the semantics of code relevant to vulnerability detection, namely bug semantics, and if so, how the alignment to bug semantics relates to model performance. In this paper, we analyze the models using three distinct methods: interpretability tools, attention analysis, and interaction matrix analysis. We compare the models' influential feature sets with the bug semantic features which define the causes of bugs, including buggy paths and Potentially Vulnerable Statements (PVS). We find that (1) better-performing models also aligned better with PVS, (2) the models failed to align strongly to PVS, and (3) the models failed to align at all to buggy paths. Based on our analysis, we developed two annotation methods which highlight the bug semantics inside the model's inputs. We evaluated our approach on four distinct transformer models and four vulnerability datasets and found that our annotations improved the models' performance in the majority of settings - 11 out of 16, with up to 9.57 points improvement in F1 score compared to conventional fine-tuning. We further found that with our annotations, the models aligned up to 232% better to potentially vulnerable statements. Our findings indicate that it is helpful to provide the model with information of the bug semantics, that the model can attend to it, and motivate future work in learning more complex path-based bug semantics. Our code and data are available at https://figshare.com/s/4a16a528d6874aad51a0.


Learning to Quantize Vulnerability Patterns and Match to Locate Statement-Level Vulnerabilities

arXiv.org Artificial Intelligence

Deep learning (DL) models have become increasingly popular in identifying software vulnerabilities. Prior studies found that vulnerabilities across different vulnerable programs may exhibit similar vulnerable scopes, implicitly forming discernible vulnerability patterns that can be learned by DL models through supervised training. However, vulnerable scopes still manifest in various spatial locations and formats within a program, posing challenges for models to accurately identify vulnerable statements. Despite this challenge, state-of-the-art vulnerability detection approaches fail to exploit the vulnerability patterns that arise in vulnerable programs. To take full advantage of vulnerability patterns and unleash the ability of DL models, we propose a novel vulnerability-matching approach in this paper, drawing inspiration from program analysis tools that locate vulnerabilities based on pre-defined patterns. Specifically, a vulnerability codebook is learned, which consists of quantized vectors representing various vulnerability patterns. During inference, the codebook is iterated to match all learned patterns and predict the presence of potential vulnerabilities within a given program. Our approach was extensively evaluated on a real-world dataset comprising more than 188,000 C/C++ functions. The evaluation results show that our approach achieves an F1-score of 94% (6% higher than the previous best) and 82% (19% higher than the previous best) for function and statement-level vulnerability identification, respectively. These substantial enhancements highlight the effectiveness of our approach to identifying vulnerabilities. The training code and pre-trained models are available at https://github.com/optimatch/optimatch.


An Information-Theoretic and Contrastive Learning-based Approach for Identifying Code Statements Causing Software Vulnerability

arXiv.org Artificial Intelligence

Software vulnerabilities existing in a program or function of computer systems are a serious and crucial concern. Typically, in a program or function consisting of hundreds or thousands of source code statements, there are only few statements causing the corresponding vulnerabilities. Vulnerability labeling is currently done on a function or program level by experts with the assistance of machine learning tools. Extending this approach to the code statement level is much more costly and time-consuming and remains an open problem. In this paper we propose a novel end-to-end deep learning-based approach to identify the vulnerability-relevant code statements of a specific function. Inspired by the specific structures observed in real world vulnerable code, we first leverage mutual information for learning a set of latent variables representing the relevance of the source code statements to the corresponding function's vulnerability. We then propose novel clustered spatial contrastive learning in order to further improve the representation learning and the robust selection process of vulnerability-relevant code statements. Experimental results on real-world datasets of 200k+ C/C++ functions show the superiority of our method over other state-of-the-art baselines. In general, our method obtains a higher performance in VCP, VCA, and Top-10 ACC measures of between 3\% to 14\% over the baselines when running on real-world datasets in an unsupervised setting. Our released source code samples are publicly available at \href{https://github.com/vannguyennd/livuitcl}{https://github.com/vannguyennd/livuitcl.}


Robots that admit mistakes foster better conversation in humans

#artificialintelligence

"Sorry, guys, I made the mistake this round," it says. "I know it may be hard to believe, but robots make mistakes too." This scenario occurred multiple times during a Yale-led study of robots' effects on human-to-human interactions. The study, which will publish on March 9 in the Proceedings of the National Academy of Sciences, showed that the humans on teams that included a robot expressing vulnerability communicated more with each other and later reported having a more positive group experience than people teamed with silent robots or with robots that made neutral statements, like reciting the game's score. "We know that robots can influence the behavior of humans they interact with directly, but how robots affect the way humans engage with each other is less well understood," said Margaret L. Traeger, a Ph.D. candidate in sociology at the Yale Institute for Network Science (YINS) and the study's lead author.


Robots that admit mistakes foster better conversation in humans

#artificialintelligence

"Sorry, guys, I made the mistake this round," it says. "I know it may be hard to believe, but robots make mistakes too." This scenario occurred multiple times during a Yale-led study of robots' effects on human-to-human interactions. The study, which will publish on March 9 in the Proceedings of the National Academy of Sciences, showed that the humans on teams that included a robot expressing vulnerability communicated more with each other and later reported having a more positive group experience than people teamed with silent robots or with robots that made neutral statements, like reciting the game's score. "We know that robots can influence the behavior of humans they interact with directly, but how robots affect the way humans engage with each other is less well understood," said Margaret L. Traeger, a Ph.D. candidate in sociology at the Yale Institute for Network Science (YINS) and the study's lead author.


How Can We Bond With Robots?

#artificialintelligence

The robot makes a mistake, costing the team a round. Like any good teammate, it acknowledges the error. "Sorry, guys, I made the mistake this round," it says. "I know it may be hard to believe, but robots make mistakes too." This scenario occurred multiple times during a Yale-led study of robots' effects on human-to-human interactions.